For this project, I will attempt to classify handwritten digits using SVD's left-singular vectors as fundamental subspaces for each digit.

The approach will be to generate one distinctive subspace of orthogonal vectors - one U matrix for each class/digit - using Singular Value Decomposition. After creating these matrices, a classifier can be constructed to assign any new unknown image into a class (0, 1, 2, ... ,9).

In the first code block, I'm just importing some data for images. The dataset contains one column for each pixel (28x28 images, so 784 column vectors), plus one column for the label (as we can see in the front of the dataframe below).



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv("train.csv")
data.head()









    Out[1]:






  
    
      
      label
      pixel0
      pixel1
      pixel2
      pixel3
      pixel4
      pixel5
      pixel6
      pixel7
      pixel8
      ...
      pixel774
      pixel775
      pixel776
      pixel777
      pixel778
      pixel779
      pixel780
      pixel781
      pixel782
      pixel783
    
  
  
    
      0
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3
       4
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 785 columns

So that's what the dataset looks like, each pixel is a column, each row is an image in the dataset. Actually I have a corresponding 'test' dataset but I'll just separate this one into train/test later on since it's big enough.

All values are grayscale 0-255. Here are some images below. I just reshaped each image vector into 28x28 for showing.



In [2]:

    
print data.iloc[1][0]
plt.imshow(data.iloc[1][1:].reshape(28,28),cmap='Greys')
plt.show()
print data.iloc[28][0]
plt.imshow(data.iloc[28][1:].reshape(28,28),cmap='Greys')
plt.show()



In [3]:

    
np.unique(data['label'],return_counts=True)









    Out[3]:





(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([4132, 4684, 4177, 4351, 4072, 3795, 4137, 4401, 4063, 4188]))

In the below bit of code, I made a function to separate the dataset into just 5 classes for demonstration purposes (only 0, 1, 2, 3, and 4). I also separated each of these digit datasets into a train and test dataset by assigning the first 80% of the dataset to train, and the rest for test. This is for each digit dataset.

I then compute the SVD for each train dataset so that I will have zero_u, one_u, ... , four_u. These will be the left singular vectors which I will use later for classification purposes.



In [14]:

    
def getTrainTest(digit):
    digit_train = digit[0:int(len(digit)*.8)]
    digit_train = digit_train[digit_train.columns[1:]].T
    digit_test = digit[int(len(digit)*.8):]
    digit_test = digit_test[digit_test.columns[1:]]
    return (digit_train,digit_test)

zero = data[data['label']==0]
zero_train,zero_test = getTrainTest(zero)

one = data[data['label']==1]
one_train,one_test = getTrainTest(one)

two = data[data['label']==2]
two_train,two_test = getTrainTest(two)

three = data[data['label']==3]
three_train,three_test = getTrainTest(three)

four = data[data['label']==4]
four_train,four_test = getTrainTest(four)

zero_u,e,v = np.linalg.svd(zero_train,full_matrices=False)
one_u,e,v = np.linalg.svd(one_train,full_matrices=False)
two_u,e,v = np.linalg.svd(two_train,full_matrices=False)
three_u,e,v = np.linalg.svd(three_train,full_matrices=False)
four_u,e,v = np.linalg.svd(four_train,full_matrices=False)

#Regarding full_matrices = False
#If True, U and Vh are of shape (M,M), (N,N).
#If False, the shapes are (M,K) and (K,N), where K = min(M,N).



In [15]:

    
print zero_u.shape
print e.shape
print v.shape









    



(784, 784)
(784,)
(784, 3257)

Seeing the singular vectors reconstructed is pretty cool. Especially the '0' one. See below for '3' and '0'.



In [16]:

    
for i in range(4):
    plt.imshow(three_u[:,i].reshape(28,28),cmap='Greys')#first 5 columns of U
    plt.show()



In [17]:

    
for i in range(4):
    plt.imshow(zero_u[:,i].reshape(28,28),cmap='Greys')#first 5 columns of U
    plt.show()

Below I have written a classification function and I just went through each of the test datasets to classify them as a digit. I used the U matrices of each of the digits (zero_u, one_u, etc.) as the 'classifier' for each class, and I iterated through them to get their resulting output. The minimum of the output is the class which is assigned by this function.

The function utilizes the residual given by $\lVert z - U\alpha \lVert _2$, and to minimize this, we just have $\alpha = U^Tz$. The idea behind it is that the linear combination of singular vectors is most appropriate for the digit which it is constructed from, so an unknown '0' image would have the smallest residual with the '0' $U$ matrix residual.



In [18]:

    
def classifyUnknownDigit(newDigit):
    classes = [zero_u,one_u,two_u,three_u,four_u]
    values = []
    for U in classes:
        values.append(np.linalg.norm((np.identity(len(U))-np.matrix(U)*np.matrix(U.T)).dot(newDigit),ord=2)/np.linalg.norm(newDigit,ord=2))
    return values.index(min(values))
    
zero_pred = []
one_pred = []
two_pred = []
three_pred = []
four_pred = []
for i in range(len(four_test)):
    four_pred.append(classifyUnknownDigit(four_test.iloc[i]))
for i in range(len(zero_test)):
    zero_pred.append(classifyUnknownDigit(zero_test.iloc[i]))
for i in range(len(two_test)):
    two_pred.append(classifyUnknownDigit(two_test.iloc[i]))
for i in range(len(one_test)):
    one_pred.append(classifyUnknownDigit(one_test.iloc[i]))
for i in range(len(three_test)):
    three_pred.append(classifyUnknownDigit(three_test.iloc[i]))
    
print "Accuracy"
print "------------"
print "0: ", zero_pred.count(0)/1.0/len(zero_pred) #count the number of 0's, divide by length of list to get accuracy. 
print "1: ", one_pred.count(1)/1.0/len(one_pred)
print "2: ", two_pred.count(2)/1.0/len(two_pred)
print "3: ", three_pred.count(3)/1.0/len(three_pred)









    



Accuracy
------------
0:  0.452237001209
1:  0.965848452508
2:  0.916267942584
3:  0.957520091848

Woops forgot to print the 4 predictions. The above computation takes a few minutes and really beats up the CPU.



In [19]:

    
print "4: ", four_pred.count(4)/1.0/len(four_pred)









    



4:  0.786503067485

It's interesting to see how some digits are classified well like 1, 2, and 3, whereas the 0 and 4 are not as great. Still, 95% is pretty bad for real-world applications, especially for something as important as sending mail to the right people. The article talks about just letting humans classify the digits manually if it's not completely certain about a prediction (if there is no large distinction between values among the classifier matrices).

The function below prints the classes and number of times each class was selected for the zero_pred list. We can see that 0 was classified correctly 374 times, and misclassified a lot as a 3, then as a 2, and almost never as a 1 or 4. I guess the curves of the 2 and the 3 might have to do with this since 0's are curvey as well.



In [20]:

    
np.unique(zero_pred,return_counts=True)









    Out[20]:





(array([0, 1, 2, 3, 4]), array([374,   5, 111, 335,   2]))

Similarly doing this for the four classifier, we see that it's pretty good but it's confusing 4's for 3's more than 1's,2's, and 0's. I guess the middle line of the '3' digit might be contributing to some confusion.



In [21]:

    
np.unique(four_pred,return_counts=True)









    Out[21]:





(array([0, 1, 2, 3, 4]), array([  1,  43,  45,  85, 641]))

So that's the gist of SVD classification. It's very slow, and not very accurate, but it looks pretty cool.

Allen S. Math 581 CSUN.



In [ ]:

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...